{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f168e06c",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/vector_stores/PineconeIndexDemo-Hybrid.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "307804a3-c02b-4a57-ac0d-172c30ddc851",
   "metadata": {},
   "source": [
    "# Pinecone Vector Store - Hybrid Search"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f821db5",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4fe98b73",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-vector-stores-pinecone \"transformers[torch]\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7010b1d-d1bb-4f08-9309-a328bb4ea396",
   "metadata": {},
   "source": [
    "#### Creating a Pinecone Index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ce3143d-198c-4dd2-8e5a-c5cdf94f017a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pinecone import Pinecone, ServerlessSpec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ad14111-0bbb-4c62-906d-6d6253e0cdee",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"PINECONE_API_KEY\"] = \"...\"\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\"\n",
    "\n",
    "api_key = os.environ[\"PINECONE_API_KEY\"]\n",
    "\n",
    "pc = Pinecone(api_key=api_key)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6123399c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# delete if needed\n",
    "pc.delete_index(\"quickstart\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2c90087-bdd9-4ca4-b06b-2af883559f88",
   "metadata": {},
   "outputs": [],
   "source": [
    "# dimensions are for text-embedding-ada-002\n",
    "# NOTE: needs dotproduct for hybrid search\n",
    "\n",
    "pc.create_index(\n",
    "    name=\"quickstart\",\n",
    "    dimension=1536,\n",
    "    metric=\"dotproduct\",\n",
    "    spec=ServerlessSpec(cloud=\"aws\", region=\"us-east-1\"),\n",
    ")\n",
    "\n",
    "# If you need to create a PodBased Pinecone index, you could alternatively do this:\n",
    "#\n",
    "# from pinecone import Pinecone, PodSpec\n",
    "#\n",
    "# pc = Pinecone(api_key='xxx')\n",
    "#\n",
    "# pc.create_index(\n",
    "# \t name='my-index',\n",
    "# \t dimension=1536,\n",
    "# \t metric='cosine',\n",
    "# \t spec=PodSpec(\n",
    "# \t\t environment='us-east1-gcp',\n",
    "# \t\t pod_type='p1.x1',\n",
    "# \t\t pods=1\n",
    "# \t )\n",
    "# )\n",
    "#"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "667f3cb3-ce18-48d5-b9aa-bfc1a1f0f0f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "pinecone_index = pc.Index(\"quickstart\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01c6bb69",
   "metadata": {},
   "source": [
    "Download Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "664f01b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/paul_graham/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ee4473a-094f-4d0a-a825-e1213db07240",
   "metadata": {},
   "source": [
    "#### Load documents, build the PineconeVectorStore\n",
    "\n",
    "When `add_sparse_vector=True`, the `PineconeVectorStore` will compute sparse vectors for each document.\n",
    "\n",
    "By default, it is using simple token frequency for the sparse vectors. But, you can also specify a custom sparse embedding model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a2bcc07",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
    "from llama_index.vector_stores.pinecone import PineconeVectorStore\n",
    "from IPython.display import Markdown, display"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68cbd239-880e-41a3-98d8-dbb3fab55431",
   "metadata": {},
   "outputs": [],
   "source": [
    "# load documents\n",
    "documents = SimpleDirectoryReader(\"./data/paul_graham/\").load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ba1558b3",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
      "To disable this warning, you can either:\n",
      "\t- Avoid using `tokenizers` before the fork if possible\n",
      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "bb2c43f2d48e4293b9df39ad3615080b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Upserted vectors:   0%|          | 0/22 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# set add_sparse_vector=True to compute sparse vectors during upsert\n",
    "from llama_index.core import StorageContext\n",
    "\n",
    "if \"OPENAI_API_KEY\" not in os.environ:\n",
    "    raise EnvironmentError(f\"Environment variable OPENAI_API_KEY is not set\")\n",
    "\n",
    "vector_store = PineconeVectorStore(\n",
    "    pinecone_index=pinecone_index,\n",
    "    add_sparse_vector=True,\n",
    ")\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents, storage_context=storage_context\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04304299-fc3e-40a0-8600-f50c3292767e",
   "metadata": {},
   "source": [
    "#### Query Index\n",
    "\n",
    "May need to wait a minute or two for the index to be ready"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35369eda",
   "metadata": {},
   "outputs": [],
   "source": [
    "# set Logging to DEBUG for more detailed outputs\n",
    "query_engine = index.as_query_engine(vector_store_query_mode=\"hybrid\")\n",
    "response = query_engine.query(\"What happened at Viaweb?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bedbb693-725f-478f-be26-fa7180ea38b2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<b>Paul Graham started Viaweb because he needed money. As the company grew, he realized he didn't want to run a big company and decided to build a subset of the vision as an open source project. Eventually, Viaweb was bought by Yahoo in the summer of 1998, which was a huge relief for Paul Graham.</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35ee5bcb",
   "metadata": {},
   "source": [
    "## Changing the sparse embedding model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46320213",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-sparse-embeddings-fastembed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d99347cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clear the vector store\n",
    "vector_store.clear()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4249825",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "185bdaff9cce4fd38cf3291a37df559d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from llama_index.sparse_embeddings.fastembed import FastEmbedSparseEmbedding\n",
    "\n",
    "sparse_embedding_model = FastEmbedSparseEmbedding(\n",
    "    model_name=\"prithivida/Splade_PP_en_v1\"\n",
    ")\n",
    "\n",
    "vector_store = PineconeVectorStore(\n",
    "    pinecone_index=pinecone_index,\n",
    "    add_sparse_vector=True,\n",
    "    sparse_embedding_model=sparse_embedding_model,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bdb539f9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "282f48ba565744c1a498725dbdec481f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Upserted vectors:   0%|          | 0/22 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "index = VectorStoreIndex.from_documents(\n",
    "    documents, storage_context=storage_context\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ab6250c",
   "metadata": {},
   "source": [
    "Wait a mininute for things to upload.."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70e127f6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<b>Paul Graham started Viaweb because he needed money. He recruited a team to work on building software and services, with a focus on creating an application builder and network infrastructure. However, halfway through the summer, Paul realized he didn't want to run a big company and decided to shift his focus to building a subset of the project as an open source project. This led to the development of a new dialect of Lisp called Arc. Ultimately, Viaweb was sold to Yahoo in the summer of 1998, providing relief to Paul Graham and allowing him to transition to a new phase in his life.</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = query_engine.query(\"What happened at Viaweb?\")\n",
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
